-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Refactor distinct aggregate implementations to use common buffer #18348
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice if I can pull in PrimitiveDistinctCountAccumulator to the deduplication as well, however it is specialized for types which don't need to hash through Hashable (aka non-float types) and I think there might be a performance hit if I try force them to use Hashable 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we definitely don't want to be hashing if we can avoid taht
| /// `merge_batch` and a `Vec` of `ArrayRef` that are converted to scalar values | ||
| /// in the final evaluation step so that we avoid expensive conversions and | ||
| /// allocations during `update_batch`. | ||
| pub struct GenericDistinctBuffer<T: ArrowPrimitiveType> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Main implementation here; I toyed with the idea of making this implement Accumulator and have the different functions (like median and percentile_cont) provide their evaluate logic as a closure but it got a bit messy; so for now they delegate their state/update_batch/merge_batch to this inner struct, which allows them to grab the final set of distinct values for them to do their own evaluate
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @Jefffrey -- this is really quite elegant. I am sorry it took so long to review
the only thing I think we need to do is ensure this doesn't have any impact in performance (I don't expect that it will but want to be sure)
Really nice 🏆
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we definitely don't want to be hashing if we can avoid taht
| self.values.extend(arr.iter().flatten().map(Hashable)); | ||
| } else { | ||
| self.values | ||
| .extend(arr.values().iter().cloned().map(Hashable)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice -- this is an elegant way to special case nulls/non nulls
|
🤖 |
|
🤖: Benchmark completed Details
|
|
The clickbench QQuery0 (I believe it's this query?) that is 1.18x slower doesn't use distinct so I don't think it's an actual slowdown. |
Which issue does this PR close?
distinctusage for aggregate expressions #2406Rationale for this change
Make it easier to write distinct variations of aggregate functions be refactoring some of the common code together; specifically how they handle maintaining the complete set of distinct primitive values, as this code was duplicated across different functions.
What changes are included in this PR?
Introduce new
GenericDistinctBufferwhich has methods similar toAccumulatorto manage an internalHashSetof values, so implementations likepercentile_contandsumcan use it internally and only implement their own evaluate functions.Are these changes tested?
Existing tests.
Are there any user-facing changes?
No.